-
Notifications
You must be signed in to change notification settings - Fork 4.6k
client: Change connectivity state to CONNECTING when creating the name resolver #8710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
client: Change connectivity state to CONNECTING when creating the name resolver #8710
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8710 +/- ##
==========================================
- Coverage 83.28% 83.25% -0.04%
==========================================
Files 416 419 +3
Lines 32267 32433 +166
==========================================
+ Hits 26874 27002 +128
- Misses 4019 4047 +28
- Partials 1374 1384 +10
🚀 New features to boost your workflow:
|
dial_test.go
Outdated
|
|
||
| func (s stringerVal) String() string { return s.s } | ||
|
|
||
| const errResolverBuildercheme = "test-resolver-build-failure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
resolver_wrapper.go
Outdated
| // https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md | ||
| // defines CONNECTING as follows: | ||
| // - The channel is trying to establish a connection and is waiting to | ||
| // make progress on one of the steps involved in name resolution, TCP | ||
| // connection establishment or TLS handshake. This may be used as the | ||
| // initial state for channels upon creation. | ||
| // | ||
| // We are starting the name resolver here as part of exiting IDLE, so | ||
| // transitioning to CONNECTING is the right thing to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO comments should be short and to the point.
Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.
Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:
// Set state to CONNECTING before building the name resolver
// so the channel does not remain in IDLE.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if state := cc.GetState(); state != connectivity.Idle { | ||
| t.Fatalf("Expected initial state to be IDLE, got %v", state) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AwaitState above already tested this IIUC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // Ensure that the client is in IDLE before connecting. | ||
| ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout) | ||
| defer cancel() | ||
| testutils.AwaitState(ctx, t, cc, connectivity.Idle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Moved the check for the current state to here, and got rid of the Await.
| for _, wantState := range wantStates { | ||
| waitForState(ctx, t, stateCh, wantState) | ||
| if wantState == connectivity.Idle { | ||
| tt.exitIdleFunc(ctx, cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this test the actual RPC error when we use an RPC to exit idle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I changed the test a little to make that happen.
But as part of doing that, I realized a thing or two:
- The RPC is not failing with a status error, because the
idle.Manager.ExitIdleModewhich is called when an RPC has to kick the channel out ofIDLE, does not embed the error returned fromClientConn.ExitIdleMode. But if we make it embed the error, we will have to return a status error from the latter, which is doable. But that brings the following questions:- What code do we return? I'm torn between
UnavailableandInternal, and leaning towards the latter - This would also make
Dialfail with a status error which I find a little odd.
- What code do we return? I'm torn between
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Unavailable and then it's treated the same as if the resolver did a ReportError immediately instead of failing to build? But probably we should see what the C++/Java lame/failing channels do before deciding, since that seems like the most equivalent scenario in languages where the resolver can't fail to build -- it just doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to Unavailable. Also changed the idleness manager code to embed the error, so that the RPC can see the status code.
This does mean that the error returned from Dial in this scenario would be a status error. But I guess that's an OK thing to do, unless you feel otherwise. In which case, we can explicitly return a new error here and thereby remove the embedding:
Line 266 in 76c67d1
| return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to have the idlenessmgr continue to return the resolver error directly, and have the RPC path return an Unavailable status with that error as the message.
Now that I look more closely, what happens to the second RPC after this happens, with the current state of your code? It seems it will only do anything with the idlenessMgr on the initial exit from idle (here, which short-circuits if not idle here), and nothing ever produces a picker when the initial exit from idle mode fails, so I think the channel will just go to the picker wrapper and wait for a picker until the RPC times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to not embedding the error from the idleness manager and handling the status code in the rpc path.
Also, the second RPC works just fine (i.e., that also returns Unavailable just like the first one), because when exiting IDLE fails, we undo the idle entry process. So, the next RPC behaves just like the first RPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does that mean? I thought the goal was to stay in TF until the idle timer expires, not go immediately back to IDLE and rebuild the resolver on the next call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does that mean?
By "we undo the idle entry process" I meant the internals of the idleness manager in terms of the state being tracked etc. We will move to TF at the channel level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'm confused.
How does the error get from the name resolver to the RPC after we've left idle. I don't see where we're saving it in an RPC picker, e.g., or where the idleness manager is storing the "last error encountered while trying to leave idle" so that it can return it on subsequent RPCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this approach seems wrong to me.
I think ExitIdleMode should be infallible. Instead, if the resolver builder fails, the channel should still leave idle mode, and it should make sure its state is set such that future RPCs will fail.
| // Tests the case where the resolver reports an error to the channel before | ||
| // reporting an update. Verifies that the channel eventually moves to | ||
| // TransientFailure and a subsequent RPC returns the error reported by the | ||
| // TransientFailure and a subsequent RPCs returns the error reported by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ a / /
| for _, wantState := range wantStates { | ||
| waitForState(ctx, t, stateCh, wantState) | ||
| if wantState == connectivity.Idle { | ||
| tt.exitIdleFunc(ctx, cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'm confused.
How does the error get from the name resolver to the RPC after we've left idle. I don't see where we're saving it in an RPC picker, e.g., or where the idleness manager is storing the "last error encountered while trying to leave idle" so that it can return it on subsequent RPCs.
| Authority: ccr.cc.authority, | ||
| MetricsRecorder: ccr.cc.metricsRecorderList, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this diff & file.
Fixes #7686
Current Behavior
Connectbeing called or because of an RPC), if name resolver creation fails, we stay in IDLE.New Behavior
Connectbeing called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer firesRELEASE NOTES: